Testing the adaptive hypothesis of lagging-strand encoding in bacterial genomes

Genes are preferentially encoded on the leading instead of the lagging strand of DNA replication in most bacterial genomes1, likely because lagging-strand encoding is selectively disfavored. Merrikh and Merrikh, however, proposed that lagging-strand encoding is adaptive, based on their inferred gene inversions and a comparison of nucleotide substitution rates2. Here we point out methodological flaws and errors in their analyses and logical problems of their interpretation. Our new analysis of their data and analysis of other publicly available data do not support the adaptive hypothesis of lagging-strand encoding. Lagging-strand encoding can cause head-on (HO) collisions between DNA polymerases and RNA polymerases that induce transcriptional abortion, replication delay, and possibly mutagenesis1, so is expected to be deleterious relative to leadingstrand encoding. However, there are still genes encoded on the lagging strand, an observation that has been explained by a balance between deleterious mutations bringing genes from the leading to the lagging strand and purifying selection purging such mutations3,4. This mutation-selection balance hypothesis predicts that the probability that a gene is encoded on the lagging strand decreases with the detriment of its lagging-strand encoding relative to leading-strand encoding, explaining why highly expressed genes and essential genes are underrepresented on the lagging strand5,6. By contrast, Merrikh and Merrikh2 asserted that the observed lagging-strand encoding is adaptive because of beneficial mutations brought by the potentially increased mutagenesis resulting from HO collisions. Following Merrikh and Merrikh2, we refer to the leadingstrand encoded genes as co-directional (CD) genes, because the movement of DNA and RNA polymerases in these genes is CD, and refer to lagging-strand encoded genes as HO genes. Merrikh and Merrikh’s primary evidence for the adaptive hypothesis was their inference that the fraction of present-day HO genes that were previously CD exceeds the fraction of present-day CD genes that were previously HO in each of the six bacterial species examined, which they interpreted as natural selection for HO under the reasonable assumption that the inversion mutation from HO to CD and that from CD to HO have equal rates per gene. However, if this interpretation were true, the number of HO genes would gradually rise and eventually exceed the number of CD genes, contradicting the preponderance of CD genes in most bacterial genomes. The cause of the above paradox is that to infer selection, one should compare the CD-to-HO inversion rate with the HO-to-CD inversion rate. But the CD-to-HO rate does not equal the fraction of present-day HO genes that were CD, but the fraction of previously CD genes that are now HO. The same can be said for the HO-to-CD rate. Therefore, the fractions estimated by Merrikh and Merrikh have no bearing on the hypothesis being tested. Based on Merrikh and Merrikh’s estimates of previously and present-day HO genes and CD genes (data from Supplementary Table 1 in Merrikh and Merrikh2), we found that the rate of inversion from CD to HO is lower than the reverse rate in four of the six species examined (Table 1), consistent with the prediction of the mutation-selection balance hypothesis but opposite to that of the adaptive hypothesis. In fact, a lower inversion rate from CD to HO than the converse rate was reported 20 years ago in each of the four bacterial species examined then7. Merrikh and Merrikh’s mistake is puzzling given that they cited this study in their paper2. Furthermore, Merrikh and Merrikh’s inference of previously HO genes and previously CD genes2 is error-prone. For each genic region, they computed the leading-strand GC skew by the difference in frequency between guanine (G) and cytosine (C), relative to the total frequency of G and C. They assumed that a negative GC skew means that the gene has been recently inverted2, based on a tendency for the leading strand to have a positive GC skew due to mutational bias8. It should be noted that the G and C frequencies of a genic region are influenced not only by mutation but also by selection9. To the best of our knowledge, no study has used the GC skew alone to infer gene inversion in bacteria. Comparison of gene orientation determined from genome assemblies is the standard method to infer gene inversion in bacterial (and organelle) genomes, while the GC skew is sometimes used as confirmatory evidence7,10,11. Even when the GC skew is used as a confirmation, the common practice is to calculate it at third codon positions or four-fold degenerate sites, because nucleotide frequencies at these sites are subject to weaker https://doi.org/10.1038/s41467-022-30000-8 OPEN

G enes are preferentially encoded on the leading instead of the lagging strand of DNA replication in most bacterial genomes 1 , likely because lagging-strand encoding is selectively disfavored. Merrikh and Merrikh, however, proposed that lagging-strand encoding is adaptive, based on their inferred gene inversions and a comparison of nucleotide substitution rates 2 . Here we point out methodological flaws and errors in their analyses and logical problems of their interpretation. Our new analysis of their data and analysis of other publicly available data do not support the adaptive hypothesis of lagging-strand encoding.
Lagging-strand encoding can cause head-on (HO) collisions between DNA polymerases and RNA polymerases that induce transcriptional abortion, replication delay, and possibly mutagenesis 1 , so is expected to be deleterious relative to leadingstrand encoding. However, there are still genes encoded on the lagging strand, an observation that has been explained by a balance between deleterious mutations bringing genes from the leading to the lagging strand and purifying selection purging such mutations 3,4 . This mutation-selection balance hypothesis predicts that the probability that a gene is encoded on the lagging strand decreases with the detriment of its lagging-strand encoding relative to leading-strand encoding, explaining why highly expressed genes and essential genes are underrepresented on the lagging strand 5,6 . By contrast, Merrikh and Merrikh 2 asserted that the observed lagging-strand encoding is adaptive because of beneficial mutations brought by the potentially increased mutagenesis resulting from HO collisions. Following Merrikh and Merrikh 2 , we refer to the leadingstrand encoded genes as co-directional (CD) genes, because the movement of DNA and RNA polymerases in these genes is CD, and refer to lagging-strand encoded genes as HO genes. Merrikh and Merrikh's primary evidence for the adaptive hypothesis was their inference that the fraction of present-day HO genes that were previously CD exceeds the fraction of present-day CD genes that were previously HO in each of the six bacterial species examined, which they interpreted as natural selection for HO under the reasonable assumption that the inversion mutation from HO to CD and that from CD to HO have equal rates per gene. However, if this interpretation were true, the number of HO genes would gradually rise and eventually exceed the number of CD genes, contradicting the preponderance of CD genes in most bacterial genomes. The cause of the above paradox is that to infer selection, one should compare the CD-to-HO inversion rate with the HO-to-CD inversion rate. But the CD-to-HO rate does not equal the fraction of present-day HO genes that were CD, but the fraction of previously CD genes that are now HO. The same can be said for the HO-to-CD rate. Therefore, the fractions estimated by Merrikh and Merrikh have no bearing on the hypothesis being tested. Based on Merrikh and Merrikh's estimates of previously and present-day HO genes and CD genes (data from Supplementary Table 1 in Merrikh and Merrikh 2 ), we found that the rate of inversion from CD to HO is lower than the reverse rate in four of the six species examined (Table 1), consistent with the prediction of the mutation-selection balance hypothesis but opposite to that of the adaptive hypothesis. In fact, a lower inversion rate from CD to HO than the converse rate was reported 20 years ago in each of the four bacterial species examined then 7 . Merrikh and Merrikh's mistake is puzzling given that they cited this study in their paper 2 .
Furthermore, Merrikh and Merrikh's inference of previously HO genes and previously CD genes 2 is error-prone. For each genic region, they computed the leading-strand GC skew by the difference in frequency between guanine (G) and cytosine (C), relative to the total frequency of G and C. They assumed that a negative GC skew means that the gene has been recently inverted 2 , based on a tendency for the leading strand to have a positive GC skew due to mutational bias 8 . It should be noted that the G and C frequencies of a genic region are influenced not only by mutation but also by selection 9 . To the best of our knowledge, no study has used the GC skew alone to infer gene inversion in bacteria. Comparison of gene orientation determined from genome assemblies is the standard method to infer gene inversion in bacterial (and organelle) genomes, while the GC skew is sometimes used as confirmatory evidence 7,10,11 . Even when the GC skew is used as a confirmation, the common practice is to calculate it at third codon positions or four-fold degenerate sites, because nucleotide frequencies at these sites are subject to weaker protein-related selection 7,10,11 . By contrast, Merrikh and Merrikh computed the GC skew of a gene using its entire coding region, further increasing the likelihood of erroneous inferences of gene inversion.
We thus used the standard method to infer gene inversion in the six species analyzed by Merrikh and Merrikh. For each focal species A, a closely related species B from the same genus and an outgroup species C were chosen (Table 2), and orthologs among the three species were identified by reciprocal best hits from protein BLAST analysis (Supplementary Data 1). Gene orientation was determined from reference genome assemblies and inversions were inferred using the parsimony principle. We then counted the number of inversions in the lineage leading to the focal species A from the common ancestor of A and B. The results showed that the rate of CD-to-HO inversion is significantly different from the rate of HO-to-CD inversion in three of the six species. In all three cases, the former rate is lower than the latter rate (Table 3), again supporting the mutation-selection balance hypothesis but refuting the adaptive hypothesis. Under the assumption that the number of HO genes observed from a species has reached equilibrium, the number of CD-to-HO inversions should equal the number of HO-to-CD inversions. However, we observed that the former is greater than the latter in each species (Table 3). This is likely because the evolutionary time concerned (since the divergence of species A from B) is relatively short such that a sizable fraction of deleterious CD-to-HO inversions has yet   to be purged. Such a time lag in the effect of purifying selection is well known 12 . It is notable that we identified 114 HO-to-CD and 197 CD-to-HO inversions in the six species, while Merrikh and Merrikh identified an order of magnitude more inversions (1468 and 3669, respectively) in the same species 2 . Apart from the general unreliability of Merrikh and Merrikh's method, the large disparity may also be due to the fact that we aimed to detect inversions that occurred since the divergence between species A and B, while inversions detected by Merrikh and Merrikh could have occurred at any time (though older inversions have lower detectabilities). The fact that only 1 HO-to-CD and 145 CD-to-HO inversions we detected are also detected by Merrikh and Merrikh suggests that their method not only has potential falsepositive errors but also makes numerous false-negative errors. The particularly high false-negative rate in detecting HO-to-CD inversions by Merrikh and Merrikh's method is probably because G is selectively favored over C on the coding strand due to the fact that G-containing codons tend to code for amino acids with relatively low synthetic costs 9 , a confounding factor ignored in Merrikh and Merrikh's method. Another possibility is that, if a gene of species A has been inverted twice, once shortly before and once after the separation of A from B, the standard method will recognize the more recent inversion, but Merrikh and Merrikh's method will likely miss it.
In addition to analyzing gene inversions, Merrikh and Merrikh estimated the synonymous (d S ) and nonsynonymous (d N ) nucleotide substitution rates of individual genes 2 . They reported that d S is not significantly different between HO and CD genes, but d N and d N /d S are significantly higher for HO than CD genes. The d S comparison suggests that the point mutation rate is not different between HO and CD genes in coding regions, consistent with experimental data 13,14 and arguing squarely against the basis of the adaptive hypothesis that the (beneficial) mutation rate is higher for HO than CD genes. Merrikh and Merrikh interpreted the results on d N and d N /d S as evidence for positive selection of HO genes. However, higher d N and d N /d S could also result from a relaxation of purifying selection. Given that highly expressed genes and essential genes are underrepresented among HO genes, relaxation of purifying selection seems a more reasonable interpretation 15 . Similarly, the observation of a larger fraction of genes with d N /d S > 1 among HO genes than CD genes 2 can be explained by a relaxation of purifying selection on HO genes, because no statistical test was performed by Merrikh and Merrikh to show that any gene has its d N /d S significantly exceeding 1, the criterion for establishing positive selection. If such statistical tests are to be performed, corrections for multiple testing would be necessary to guard against false-positive results. Merrikh and Merrikh 2 also reported enrichment of several functional groups among HO genes relative to CD genes. This non-randomness could be a byproduct of the known differences between HO and CD genes in expression level and essentiality 5,6 , so cannot be used to support the adaptive hypothesis.
In conclusion, our reanalysis of the empirical data of Merrikh and Merrikh 2 and new analysis found no evidence for the adaptive hypothesis of lagging-strand encoding in bacterial genomes. Instead, all available data are broadly consistent with the mutation-selection balance hypothesis.

Methods
Data sources. All genome and sequence data of the species used in this study (see Table 2) were downloaded from NCBI (ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/ bacteria/). The reanalysis presented in Table 1 was based on the data in Supplementary Table 1 of Merrikh and Merrikh 2 .
Estimation of inversion rates. The HO-to-CD inversion rate was estimated by the number of HO-to-CD inversion events divided by the number of previously HO genes. The number of previously HO genes is the number of present-day HO genes plus the number of HO-to-CD inversion events minus the number of CD-to-HO inversion events. The CD-to-HO inversion rate was similarly estimated.
Phylogeny-based identification of gene inversions. For each focal species A, a closely related species (B) from the same genus and an outgroup species (C) were chosen. In each three-species group, protein BLAST analysis was performed to identify orthologs. Gene orientations were determined from reference genome assemblies, followed by the inference of inversion events using the parsimony principle. Specifically, in a three-species orthologous gene group, when the gene orientation is different between the two ingroup species, an inversion event is inferred for the ingroup species in which the gene orientation differs from that in the outgroup.